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Abstract 

In this paper we review the two main approaches to the problem of Malmquist bias 
which have been adopted in the cosmology literature, and show how these two formulations 
of the problem represent fundamentally different views of the nature of probability. We 
discuss the assumptions upon which both approaches are based and indicate some of 
their limitations. In particular we identify a basic flaw in the definition of homogeneous 
and inhomogeneous Malmquist corrections as they have frequently been applied in the 
literature, and indicate how this flaw may be corrected. 



1 Introduction 



In recent years the analysis of the peculiar velocity field using redshift-independent galaxy 
distance indicators has significantly enhanced our understanding of the formation and evolution 
of large scale structure. The most prevalent examples of these indicators have been the Tully- 
Fisher (TF) and D n — a relations, and their use in e.g. the potent analyses has contributed 
to the solid body of evidence in support of coherent streaming motions over scales on the 
order of lOOMpc 0,0- evidence which, nevertheless, has attracted considerable controversy 
in the literature, not least because of the difficulties which it presents for popular theories of 
structure formation. Much of this debate has focussed upon the statistical properties of the 
TF and D n — a relations and the issue of how best to deal with the systematic errors which 
arise when they are applied to surveys which are subject to observational selection. These 
systematic effects have been referred to generically in the literature as 'Malmquist bias'. There 
exists, however, a great deal of confusion over precisely what is meant by Malmquist bias (or 
the 'M word' as it has been labelled at this conference!). Different authors have used the term 



'Malmquist bias' to denote different - and often contradictory - effects JTTj], |~2|> 0- It is 
not surprising, therefore, that no consensus has yet been reached about how best to eliminate 
Malmquist bias in studying the peculiar velocity field. 

In this paper we examine the statistical basis of the different approaches to the problem of 
Malmquist bias which have been adopted in the literature. We discuss the model assumptions 
upon which each depends, and the extent to which these assumptions may be generalised. We 
thus indicate how one may formulate a rigorous, consistent treatment of the problem of galaxy 
distance estimation and Malmquist bias. 



2 Unbiased distance estimators 

We have already heard from Brent Tully at this meeting a number of adjectives which he 
used to describe previous approaches to the problem of Malmquist bias in the literature. We 
can now add two further terms to this list: frequentist and Bayesian . Examples of where 
the former approach has been adopted in recent literature include |§, f9|, 0, Hl7| , fl20| : this 
approach follows closely the original treatment of the statistical effect by Malmquist [|13|| . The 
approach which we may categorise as Bayesian, on the other hand, has been adopted in e.g. 
iT| , [|T^1 , 0], and is, in fact, closer in spirit to early work by Eddington, ||, on correcting 



observational errors. (Indeed Lynden-Bell also refers to Eddington-Malmquist bias, recognising 
the true origins of his statistical approach to the effect). Our adoption of the terms frequentist 
and Bayesian reflects the fact that at the heart of the difference between these two approaches 
lies the long-standing dichotomy between a Bayesian and a frequentist view of the nature of 
probability, as we will now explain. 



2.1 A 'frequentist' approach to bias 

The 'frequentist' view of the nature of probability is essentially based on the intuitively familiar 
idea that the probability of an event is a measure of the relative frequency of that event 
occurring in a large number of repeated experiments. In the limit as the number of experiments 
tends to infinity the frequency histogram reduces to the probability density function (pdf) of 



a random variable. In the present context our 'experiment' is the estimation of the distance 
to a given galaxy - e.g. by measuring its apparent magnitude and 21cm line width in the case 
of the TF relation. Crucial to the frequentist approach is the idea that the true distance of 
this galaxy is a fixed, though of course unknown, parameter - an 'unknown state of nature' in 
statistical parlance. 

We can state these ideas more rigorously as follows. Suppose we are estimating the distance 
to a given galaxy which lies at true distance, tq. Let f denote a galaxy distance estimator. For 
the TF relation f is a function of apparent magnitude and 21cm line width - i.e. f = /(m, P). 
(Here we have introduced several items of notation. We follow the standard statistical practice 
of denoting a random variable by a bold face character, and an estimator of a parameter by a 
caret. We also adopt P as a shorthand for log(line width) - a notation used by several previous 
authors [[|, |2"3|1 ). The precise form of this function depends upon the joint distribution of m 
and P, and also how the TF relation is calibrated, as we discuss below. 

Consider now the pdf, p(?|ro), of f, conditional upon the true distance of the galaxy - the 



parameter which we are estimating, f is defined to be unbiased (c.f. [10], [14|) if the mean, or 
expected , value of f is equal to the true distance, r , of the galaxy. More generally the bias, 
B(r , r ), of f is defined as:- 

B(r , r ) = E(r\r ) - r = / rp(r\r )dr - r (1) 

Equation [1] immediately demonstrates the importance of finding an unbiased distance estimator 
in this approach, since we see that the bias of f is a function of the (unknown) true distance, 
ro- We should also note that we can define in an analogous fashion the bias of an estimator of 
any parameter: in particular an estimator of the true log distance of a galaxy. 

Having thus defined what we mean by the bias of an estimator we now consider how this 
bias arises in estimating galaxy distances, taking again as our example the TF relation. 

2.2 Malmquist bias in the Tully-Fisher relation 

Let M denote the absolute magnitude of the given galaxy. Ignoring absorption and cosmological 
effects, the following equation holds:- 

m = M + 51ogr + 25 (2) 

where the true distance, r , of the galaxy is in Mpc. From equation |2| an obvious estimator of 
log distance is given by:- 

logr = 0.2(m — M — 25) (3) 
where M is some estimator of the galaxy's absolute magnitude. In early studies of the peculiar 



velocity field | 18| , fT9|| a fixed fiducial value for M was adopted from prior considerations - i.e. 
assuming that the observed galaxies were standard candles. The TF relation provides a better 
estimator of M by making use of the strong observed correlation between the luminosity and 
21cm line width of spirals. The relation is usually calibrated by performing a linear regression 



on a calibrating sample of galaxies whose distances are otherwise known. Thus we obtain a 
linear relationship between M and P, viz:- 

M = aP + f3 (4) 

where a and j3 are constants. 

The choice of which linear regression is most appropriate is non-trivial, however, particularly 
when one's survey is subject to observational selection effects. We can illustrate this with the 
following simple example. Suppose that the intrinsic joint distribution of absolute magnitude 
and log(line width) is a bivariate normal. The left hand panel of Figure [I] shows schematically 
the scatter in the TF relation in this case, for a calibrating sample which is free from selection 
effects - e.g. a nearby cluster. (More precisely, the ellipse shown is an isoprobability contour 
enclosing a given confidence region for M and P). The solid and dotted lines show the linear 
relationship obtained by regressing line widths on magnitudes and magnitudes on line widths 
respectively. Thus the dotted line is defined as the expected value of M at given P, while the 
solid line is defined as the expected value of P at given M. Since in practice one wishes to 
infer the value of M from the measured value of P, the M on P regression is often referred to 
in the literature as defining the 'direct' TF relation, while using the P on M regression defines 
the 'inverse' TF relation. For the bivariate normal case the equations of the direct and inverse 
regression lines follows:- 

E(M\P) = M + ^(P-P ) (5) 

CTp 

£(P|M) = P + — (M-M ) (6) 

where M , Po, <tm, cp and p denote the means, dispersions and correlation coefficient of the 
bivariate normal distribution of M and P. Both regression lines can be written in the form of 
equation [|, thus defining M as a function of P, although of course the constants a and j3 will 
be different in each case. Moreover the definition of M is subtly different in each case. For 
the direct regression M is identified as the mean absolute magnitude at the observed log line 
width. For the inverse regression on the other hand M is defined such that the observed log 
line width is equal to its expected value when M = M. Consequently, as is apparent from their 
slopes, the direct and inverse regression lines give rise to markedly different distance estimators, 
although it is straightforward to show that in the absence of selection effects both estimators 
are unbiased. 

The situation is very different when we include the effects of observational selection, however. 
This is illustrated in the right hand panel of Figure [l|, which shows the scatter in the TF relation 
in a calibrating sample subject to a sharp cut-off in absolute magnitude - as would be the case 
in e.g. a distant cluster observed in an apparent magnitude-limited survey. We can see that in 
this case the slope of the direct regression of M on P is substantially changed from that in the 
nearby cluster - indeed the direct regression is no longer linear at all. This means that if one 
calibrates the TF relation in the nearby cluster using the direct regression and then applies this 
relation to the more distant cluster, one will systematically underestimate its distance, since 



Figure 1: Schematic Tully-Fisher relations for the case of a nearby, completely sampled, cluster 
and a distant cluster subject to a sharp selection limit on absolute magnitude 



the expected value of M given P in the distant cluster is systematically brighter than that 
in the nearby cluster as fainter galaxies progressively 'fade out' due to the magnitude limit. 
The corresponding 'direct', or 'M on P', log distance estimator - obtained by substituting the 
appropriate constants into equation |] and then equation |3] - will therefore be negatively biased. 
This is precisely analogous to the bias identified by Malmquist [|13| in considering the mean 



absolute magnitude of standard candles in a sample with a sharp apparent magnitude limit. 
2.3 Properties of 'Schechter' estimators 



In an important paper Schechter observed that the slope of the inverse regression line is 
unchanged, irrespective of the completeness of one's sample, provided that the selection effects 
are in magnitude only. We can see that this observation is valid in the simple case considered in 
Figure [I]. In other words the expected value of P given M is unaffected by the Malmquist effect 
and, therefore, defines an unbiased log distance estimator. Although the unbiased property of 
the inverse regression line has been generally recognised (c.f. jjT7|, 0, flT2|), its ramifications 



for estimating galaxy distances have not been fully appreciated. 

We have carried out an extension of Schechter's ideas to more realistic situations 0, and 
examined the extent to which the assumptions upon which they are based may be generalised. 
We now briefly summarise the properties of unbiased 'Schechter' estimators. 

1. In a sample subject to observational selection effects, it is possible to define a linear 
estimator of log distance which is unbiased at all true distances - provided that the 
following two conditions are met: the measurements of line width are free from selection 
effects and the conditional expectation of log(line width) at given absolute magnitude 



is linear in M. Moreover, the appropriate linear combination of M and P corresponds 
exactly to the estimator derived from the inverse regression of line widths on magnitudes, 
as prescribed in |21 ] . 



The 'Schechter' estimator is the only unbiased linear estimator. Any other linear com- 
bination of magnitude and line width, and in particular any other regression line, yields 
an estimator which is biased at all true distances for a magnitude selection function. 
Examples of biased regression lines in this case include not only the direct or 'M on P' 



regression shown above (c.f. fL2| , [fj]) but also the orthogonal |J, bisector |L7| and mean 



15| regression lines. 



3. The shape of the pdf, and hence in particular the variance of the Schechter estimator, is 
constant at all true distances and is in fact identical to that of the intrinsic conditional 
distribution of P given M. This conditional distribution is frequently modelled to be 
gaussian, thus implying that the Schechter estimator is gaussian in this case. Again the 
Schechter estimator is unique in this regard - for any other general linear estimator the 
shape of its distribution becomes distorted at large true distances, as the effects of selection 
become significant. It follows from this property that confidence intervals derived from 
the Schechter estimator will have constant width ||. 

4. The unbiasedness of the Schechter estimator holds for an arbitrary luminosity function. 
This is a particularly useful property, since the bias of any other linear estimator will 
depend explicitly upon the form of the luminosity function, so that any attempt to remove 
the bias would necessarily be model dependent @. 

5. One may also define an unbiased log distance estimator for other distance indicators, 
including the D n — a and magnitude-colour relations, subject to the same condition that 
there be one observable free from selection, but not requiring one observable to be 
distance-independent. In a diameter-complete survey, for example, one may construct an 
unbiased distance estimator from the observed angular diameter and apparent magnitude. 
As above, it is easy to show that this unbiased estimator corresponds exactly to the 
regression of the selection-free observable upon the other observable. 

6. If there is no selection-free observable, then an unbiased estimator cannot be defined as 
a simple linear combination of the observables. In the context of both the Tully-Fisher 
and D n — a relations, however, this is somewhat less of a problem than one might expect. 
Most surveys will be subject to a lower selection limit on line width or velocity dispersion. 
This selection limit becomes increasingly less important at larger true distances, however. 
This is easy to understand, since at large distances only intrinsically brighter (or larger), 



and thus sufficiently large line width, galaxies will be observable |TT| . It is found that at 
cosmologically interesting distances, the Schechter estimator is still effectively unbiased 
in this case. 



2.4 A Bayesian approach to unbiased distance estimators 

We now consider the second, essentially Bayesian, approach to the problem of Malmquist bias, 
adopted in e.g. fll2| , []I2] , and ||. The crucial difference in this approach is that the true 
distance, ro of a galaxy is itself regarded as a random variable. Hence one must assign a prior 
probability distribution for r , based upon an assumed spatial density distribution and selection 
function. Following the measurement of some distance estimator, f for each galaxy - using e.g. 
the TF or D n — a relation - one can define a posterior distribution for r which will differ from 
the prior. It is the properties of this posterior distribution which are considered in defining an 
estimator as unbiased. We can use Bayes' theorem to derive an expression for the posterior 
distribution, p(r |f), viz:- 

_ p(f|ro)p(r ) 
P(ro|r) " /p(f|ro)p(r )dr (?) 
Here p(r ) is the prior distribution for r and p(r |r ) is known as the likelihood function, and 
is in fact simply the pdf of our distance estimator, f , as discussed previously. 

In this approach the distance estimator, f , is defined as unbiased if the expected value of 
r with respect to this posterior distribution, p(r \r), is equal to f. In general the bias of f is 
defined as:- 

B(r , r ) = E(r \r) - r = J r p(r \r )dr - f (8) 

Compare this expression with equation [I] above. By assuming a prior distribution and likelihood 
function we can see from equations |7| and ^| that one may derive a Malmquist correction to 
remove the bias of a 'raw' distance estimator, so that the corrected estimator is unbiased. 

Of course one may formulate this approach for the corresponding unbiased estimator of log 
distance in an analogous fashion. Lynden-Bell et al. [12| do precisely this, assuming a prior 
distribution which corresponds to a constant spatial number density of galaxies and assuming 
for their raw log distance estimator a gaussian pdf of mean value equal to the true log distance 
and of constant variance. These assumptions imply a constant, or homogeneous Malmquist 
correction: in other words all raw distance estimates are rescaled by a constant factor. Clearly 
this assumed prior will be incorrect - due to both galaxy clustering and observational selection 
effects - although it might be regarded as a reasonable first approximation. 

Landy and Szalay |lTJ present an improved treatment by explicitly recognising the Bayesian 



nature of this problem and proposing that one use the observed distribution of raw distance 
estimates to provide a better approximation to the prior distribution of log true distance. In 
principle this prior would take account of the effects of clustering and selection which render 
the observed distribution inhomogeneous - thus leading to the definition of an inhomogeneous 
Malmquist correction. They still assume, however, that the pdf of their raw log distance 
estimator is gaussian, with constant variance and mean equal to the true log distance. 



3 Comparing the two approaches 



One might regard the differences between the two approaches we have outlined for interpreting, 
and dealing with, the effects of Malmquist bias as of no more than semantic importance. 



This is far from the case, however. Firstly it is worth pointing out that - however valid the 
two approaches may be when considered individually - viewed together they are mutually 
inconsistent. In other words an estimator defined as unbiased in the frequentist sense must 
always be biased in the Bayesian sense, and vice versa. Hence any analysis which is not 
self-consistent in its approach to bias throughout will in general wind up deriving which are 
unbiased in neither the frequentist nor the Bayesian sense! 

Another important difference between the two approaches stems from the dependence of the 
Bayesian description upon the assumption of a prior true distance distribution. This results in 



a different Malmquist correction for field and cluster samples ||12|| . This dichotomy can lead to 
ambiguity in the case where cluster membership is unresolved - as is frequently the case with 
spiral galaxies. No such difficulty exists with the frequentist description, however. Since the 
bias of a distance estimator is defined conditionally upon the true (log) distance, it is easy to 
show [0] that the definition of an unbiased distance estimator in this approach is completely 
independent of the local galaxy number density. It would seem that this important distinction 
has not been well appreciated in the literature. 

In section |2~4] we noted that the homogenous and inhomogeneous Malmquist corrections were 
both derived on the assumption that our raw log distance estimator has a gaussian pdf with 
mean value equal to the true log distance. In other words this means that the raw log distance 
estimator is assumed to be unbiased - according to the frequentist definition of equation [j]. As 
we saw in section |2.3| , only the Schechter log distance estimator, corresponding to the inverse 
regression of line widths on magnitudes in the case of the TF relation, has this property - 
and even then only when one's sample is free from line width selection effects. If any other 
regression line is used to derive the raw distance estimator, then the assumptions underlying the 
definition homogeneous and inhomogeneous Malmquist corrections will no longer be valid, and 
the Malmquist corrections derived from the raw distance estimator will not remove the effects 
of Malmquist bias - even in the (unlikely!) case where the prior true distance distribution is 
known exactly. 

In short, therefore, Malmquist corrections must be computed using a Schechter log distance 
estimator as one's raw estimator if they are to be in any way effective. This crucial result has 
not been recognised in the literature, and to our knowledge every application of homogeneous 
and inhomgeneous Malmquist corrections has been carried out using a biased raw log distance 
estimator (c.f. Q, [|, p[, Q, !). 

We examine elsewhere in these proceedings [16| the specific implications of this important 
result for velocity field reconstruction techniques - in particular the potent method [ED]. 



4 Summary 

In this paper we have seen how one may address the problem of Malmquist bias in two distinctly 
different ways, essentially reflecting a frequentist and Bayesian view of the nature of probability. 
We have shown how, following either approach, one may in principle derive unbiased distance 
estimators and have discussed the assumptions upon which this result holds. In the frequentist 



approach the unbiased distance estimator corresponds to the inverse regression of line widths 
on magnitudes - as prescribed by Schechter pip . In the Bayesian picture unbiased estmators 



are defined by computing the appropriate Malmquist correction to a raw distance estimator, 
assuming a prior distribution for true distance. We have thus identified a serious error in the 
application of homogeneous and inhomogeneous Malmquist corrections in the literature, since 
these have been computed with a biased 'raw' distance estimator - violating a basic assumption 
in their definition. We have indicated how one may compute the proper Malmquist corrections 
by using the Schechter estimator as one's raw distance estimator. 

The question of which of these two approaches to the problem of Malmquist bias is best has 
no clear-cut answer. We discuss some of the issues in more detail elsewhere |TjJ, 0. It suffices 
to say here that the main requirement of any statistical analysis of galaxy distance estimates 
is to be consistent. One should point out, however, that there are often circumstances where 
one does not have complete freedom to choose either the frequentist or Bayesian approach. 

A common example of how this difficulty can arise is when one's velocity field data must 
be heavily smoothed - as is the case with the potent method 0. potent requires the 
computation of a smoothed radial peculiar velocity field at all points on a spherical grid, and 
accomplishes this by using very large smoothing windows, of effective radius ~ 5000kms _1 . In 
interpolating a peculiar velocity from galaxies appearing in the catalogue to a given spatial grid 
point, the essential effect of the smoothing window is to pick out the galaxy whose estimated 
position lies closest to the prescribed grid point. The actual distance of this galaxy may be 
radically different from its estimated distance, and will depend upon the true spatial distribution 
of galaxies. In requiring that the mean smoothed radial peculiar velocity be equal to the true 
radial velocity at that point, one finds that we require equation |8| to vanish - i.e. we want an 
distance estimator which is unbiased according to the Bayesian description, thus requiring the 
application of inhomogeneous Malmquist corrections. Of course, as we have pointed out above, 
these corrections will be seriously inadequate if one does not use a Schechter raw distanmce 
estimator - a fact which does not appear to have been realised by the potent authors 0, @. 

We discuss elsewhere in some detail the effects of bias and smoothing procedures on velocity 



field reconstructions with potent [Iq], 22 



References 

[1] Bertschinger, E., Dekel, A., Faber, S., Dressier, A., & Burstein, D. 1990. Astrophys. J. 
364, 370 

[2] Bicknell, G. 1992. Astrophys. J. 399, 1 

[3] Dekel, A., Bertschinger, E., Yahil, A., Strauss, M. A., Davis, M., & Huchra, J. P. 1993. 
Astrophys. J. 412, 1 

[4] Dressier, A., & Faber, S. 1990. Astrophys. J. 354, 13 

[5] Eddington, A. 1940. Mon. Not. R. astr. Soc 100, 354 

[6] Giraud, E. 1987. Astr. Astrophys. 174, 23 

[7] Hendry, M. A. 1992. Ph.D. Thesis Univ. of Glasgow 



[8] Hendry, M. A., & Simmons, J. F. L. 1990. Astr. Astrophys. 237, 275 

[9] Hendry, M. A., & Simmons, J. F. L. 1993. preprint 
[10] Kendall, M., Stuart, A. 1963. The Advanced Theory of Statistics, vol 1, Haffner (NY) 
[11] Landy, S., & Szalay, A. 1992. Astrophys. J. 391, 494 
[12] Lynden-Bell, D., et al. 1988. Astrophys. J. 326, 19 
[13] Malmquist, K. 1920. Medd. Lund. Astron. Obs. 22,1 

[14] Mood, A. M., & Graybill, A. F. 1974. Introduction to the Theory of Statistics, McGraw-Hill 
(NY) 

[15] Mould, J., et al. 1993. Astrophys. J. 409, 14 

[16] Newsam, A. M., Simmons, J. F. L., & Hendry, M. A. 'Bias in Velocity Field Recoveries' 
Cosmic Velocity Fields, Editions Frontieres, in press 

[17] Pierce, M., & Tully, R. B. 1988. Astrophys. J. 330, 579 

[18] Rubin, V. C, Thonnard, N., Ford, W. K., & Roberts, M. S. 1976. Astron. J. 81, 719 
[19] Sandage, A., & Tammann, G. A. 1975. Astrophys. J. 196, 313 
[20] Sandage, A., & Tammann, G. A. 1990. Astrophys. J. 363, 1 
[21] Schechter, P. 1980. Astron. J. 85, 801 

[22] Simmons, J. F. L., Newsam, A. M., & Hendry, M. A. 'sc potent and Max-Flow algorithms' 
Cosmic Velocity Fields, Editions Frontieres, in press 

[23] Teerikorpi, P. 1984. Astr. Astrophys. 141, 407 
[24] Tully, R. B. 1988. Nature 334, 209 



